Tools to handle trait macroecological datasets in R
The rmacroRDM package contains functions to help with the compilation of macroecological datasets. It compiles datasets into a master long database of individual observations, matched to a specified master species list. It also checks, separates and stores taxonomic and metadata information on the observations, variables and datasets contained in the data. It therefore aims to ensure full traceability of datapoints and as robust quality control, all the way through to the extracted analytical datasets.
For more details and context, see the rmacroRDM github repo
R code for this workflow available here
First source the rmacroRDM functions. Currently the best way is to just source from github using RCulr:getURL().
WARNING: Repo under continuous development
require(RCurl)
eval(parse(text = getURL("https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/R/functions.R", ssl.verifypeer = FALSE)))
eval(parse(text = getURL("https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/R/wideData_function.R", ssl.verifypeer = FALSE)))
Next, to initialise the project we need to supply valid pathways to project folders containing:
The function sets those directories in the global environment.
setDirectories(script.folder = "~/Documents/workflows/Brain_size_evolution/",
data.folder = "~/Google Drive/Brain evolution/",
envir = globalenv())
Once project folders have been set, we set up the file system by creating the required folders (if they don’t exist already) in the project folders.
setupFileSystem(script.folder = "~/Documents/workflows/Brain_size_evolution/",
data.folder = "~/Google Drive/Brain evolution/")
In this step, we initialise the environment with some required parameters to build the database and process files to it. The call below shows the default initialisation settings which you would get if you just called inti_db()
init_db(var.vars = c("var", "value", "data.ID"),
match.vars = c("synonyms", "data.status"),
meta.vars = c("qc", "observer", "ref", "n", "notes"),
taxo.vars = c("genus", "family", "order"),
spp.list_src = NULL)
I actually want to set “D0” as the file from which to extract the spp.list in a bit so I set spp.list_src = "D0".
init_db(spp.list_src = "D0")
configuring:
master.vars
match.vars
meta.vars
spp.list_src
taxo.vars
var.vars
NULL
The function appends the given arguments to environment master_config at position 2 in the search path (note position of GlobalEnvironment = 1).
Here’s a list of the values of the objects we just attached as configurations:
Next we setup the folders in input.folder/pre/ and post/ according to the configurations set in the previous step.
If the correct setup already exists, no action is taken:
setupInputFolder(input.folder)
dirs <- list.dirs(paste(input.folder, "pre/", sep = ""), full.names = T)
print(dirs)
[1] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre/"
[2] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//csv"
[3] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//n"
[4] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//notes"
[5] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//observer"
[6] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//qc"
[7] "/Users/Anna/Google Drive/Brain evolution/inputs/data/pre//ref"
The functions take advantage of the structure of the file.sytem to automate loading and linking of data and metadata through appropriate naming and location of files within the file.system.
raw/ organise all raw datapre/ save copies of the raw data files in the appropriate data (cvs) or meta.vars folders.NB meta.var data sheets should be named with the same name as the data data sheet. Take care during this stage to ensure files are named correctly and stored in the appropriate folders.
The fcodes vector specifies the details of folders in pre/ and post/ input.folder folders. It also creates appropriate code prefixes for each type of data or meta.var sheet. Note that “D” is reserved for data files, “R” for ref files and “N” for n files.
fcodes <- ensure_fcodes(meta.vars)
print(fcodes)
D R N Q O NO
"csv" "ref" "n" "qc" "observer" "notes"
Specify the file.names of the files you wish to process. If you want to process all files in the file.system use file.names = NULL.
file.names <- create_file.names(file.names = c("brainmain2.csv",
"Amniote_Database_Aug_2015.csv", "anagedatasetf.csv"))
dcodes succesfully extracted from 'data_log.csv'
print(file.names)
x0 x1
"brainmain2.csv" "Amniote_Database_Aug_2015.csv"
x2
"anagedatasetf.csv"
Load the system reference (sys.ref) files required for data processing.
load_sys.ref(fileEncoding = "mac", view = F)
loading:
metadata
data_log
vnames
attaching to env: sys.ref
NULL
*use view = T to open a viewer for each of the sys.ref files on load.
metadata.csvkable(head(metadata, 10))
| code | cat | descr | scores | levels | type | units |
|---|---|---|---|---|---|---|
| species | NOMINAL | Scientific name | NA | NA | NA | NA |
| brain.volume | MORPHOLOGICAL | Brain volume | NA | NA | CON | mm |
| brain.mass | MORPHOLOGICAL | Brain mass | NA | NA | CON | g |
| male.brain.mass | MORPHOLOGICAL | Male brain mass | NA | NA | CON | g |
| female.brain.mass | MORPHOLOGICAL | Female brain mass | NA | NA | CON | g |
| body.mass | MORPHOLOGICAL | Body mass | NA | NA | CON | g |
| male.body.mass | MORPHOLOGICAL | Male body mass | NA | NA | CON | g |
| female.body.mass | MORPHOLOGICAL | Female body mass | NA | NA | CON | g |
| telencephalic.volume.fraction | MORPHOLOGICAL | Telencephalic volume fraction | NA | NA | CON | mm |
| female.maturity | LIFE HISTORY TRAIT | Female maturity | NA | NA | INT | d |
data_log.csvkable(data_log)
| dcode | file.name | descr | source | source.contact | method | notes |
|---|---|---|---|---|---|---|
| D0 | brainmain2.csv | NA | NA | NA | NA | NA |
| D1 | Amniote_Database_Aug_2015.csv | NA | NA | NA | NA | NA |
| D2 | anagedatasetf.csv | NA | NA | NA | NA | NA |
| D3 | lifehistraits.csv | NA | NA | NA | NA | NA |
| D4 | parentalcare.csv | NA | NA | NA | NA | NA |
vnames.csvkable(head(vnames, 10))
| code | D0 | D1 | D2 | D3 | D4 | R1 | N1 |
|---|---|---|---|---|---|---|---|
| species | Scientific name | species | Scientific.name | Scientific name | Scientific_name | species | species |
| genus | NA | genus | NA | NA | NA | genus | genus |
| brain.volume_n | n | NA | Sample.size | NA | NA | NA | NA |
| brain.volume | Brain Vol. | NA | NA | NA | NA | NA | NA |
| brain.volume_ref | Source brain vol. | NA | NA | NA | NA | NA | NA |
| brain.mass | Brain mass (g) | NA | NA | NA | NA | NA | NA |
| brain.mass_n | n brain mass | NA | NA | NA | NA | NA | NA |
| brain.mass_ref | source brain mass | NA | NA | NA | NA | NA | NA |
| male.brain.mass | male brain mass | NA | NA | NA | NA | NA | NA |
| male.brain.mass_n | male brain mass no | NA | NA | NA | NA | NA | NA |
syn.links.csvUsed for taxonomic matching (plans to automate this by integrating package taxize. supplied syn.links only relates to birds).
syn.links <- read.csv(text=getURL("https://raw.githubusercontent.com/annakrystalli/rmacroRDM/master/data/input/taxo/syn.links.csv",
ssl.verifypeer = FALSE), header=T)
This step processes the .csv copies of the raw data in the pre/ folder and writes the processed files as .csv to the post/ folder, ready to be matched. The processing conserves the file.names throughout. It is important for this step that the file.system is correctly populated (ie. meta.var data sheets should be named with the same name as the data data sheet and in the correct folder).
The function runs a basic processing stage for each file in file.names:
file.system = "fromFS",file.names if file.names is vector of file names (note that only files available in the file.system are processed).c("", " ", "NA", "-999") coded as NAs by default.vnames"genus_species" and merged into a single species column. ensure column in files containing genus data is matched to code genus in vnames.custom processing scripts
you can include extra processing scripts for individual files by adding them into the {script.folder}process/ folder. To be loaded correctly, scripts need to be named appropriately:
file.name: name script as file.name. eg ."Amniote_Database_Aug_2015.R"file.name: name script as file.name appended with appropriate fcode, eg ."Amniote_Database_Aug_2015_ref.R"process_file.system(file.names, fcodes)
[1] "all files in file system have valid vnames columns"
===================================================================
processing Amniote_Database_Aug_2015.csv D1
[1] "source complete"
sourced: process/Amniote_Database_Aug_2015.R
'genus_species' combined in species column. Genus data removed
df nrow (species) : 9802 df ncol (traits) : 20
***
===================================================================
processing anagedatasetf.csv D2
df nrow (species) : 1191 df ncol (traits) : 18
***
===================================================================
processing brainmain2.csv D0
duplicate species name in D0
df nrow (species) : 2586 df ncol (traits) : 24
***
===================================================================
processing Amniote_Database_Aug_2015.csv R1
[1] "source complete"
sourced: process/Amniote_Database_Aug_2015.R
'genus_species' combined in species column. Genus data removed
df nrow (species) : 9802 df ncol (traits) : 20
***
===================================================================
processing Amniote_Database_Aug_2015.csv N1
[1] "source complete"
sourced: process/Amniote_Database_Aug_2015.R
'genus_species' combined in species column. Genus data removed
df nrow (species) : 9802 df ncol (traits) : 19
***
Use spp.list_source to specify dcode of file.name to extract spp.list from. Otherwise, supply vector of species names to species.
spp.list <- createSpp.list(species = NULL,
taxo.dat = NULL,
spp.list_src = spp.list_src)
species list extracted from dataset: D0
file.name: brainmain2.csv
str(spp.list, vec.len = 3)
'data.frame': 1906 obs. of 4 variables:
$ species : chr "Aix_galericulata" "Aix_sponsa" "Alopochen_aegyptiaca" ...
$ master.spp : logi TRUE TRUE TRUE TRUE ...
$ rel.spp : logi NA NA NA NA ...
$ taxo.status: chr "original" "original" "original" ...
- attr(*, "type")= chr "spp.list"
master <- create_master(spp.list)
str(master, max.level = 2, vec.len = 3)
List of 3
$ data :'data.frame': 0 obs. of 11 variables:
..$ species : logi(0)
..$ synonyms : logi(0)
..$ data.status: logi(0)
..$ var : logi(0)
..$ value : logi(0)
..$ data.ID : logi(0)
..$ qc : logi(0)
..$ observer : logi(0)
..$ ref : logi(0)
..$ n : logi(0)
..$ notes : logi(0)
..- attr(*, "format")= chr "master"
$ spp.list:'data.frame': 1906 obs. of 4 variables:
..$ species : chr [1:1906] "Aix_galericulata" "Aix_sponsa" "Alopochen_aegyptiaca" ...
..$ master.spp : logi [1:1906] TRUE TRUE TRUE TRUE ...
..$ rel.spp : logi [1:1906] NA NA NA NA ...
..$ taxo.status: chr [1:1906] "original" "original" "original" ...
..- attr(*, "type")= chr "spp.list"
$ metadata:'data.frame': 38 obs. of 7 variables:
..$ code : chr [1:38] "species" "brain.volume" "brain.mass" ...
..$ cat : chr [1:38] "NOMINAL" "MORPHOLOGICAL" "MORPHOLOGICAL" ...
..$ descr : chr [1:38] "Scientific name" "Brain volume" "Brain mass" ...
..$ scores: chr [1:38] NA NA NA ...
..$ levels: chr [1:38] NA NA NA ...
..$ type : chr [1:38] NA "CON" "CON" ...
..$ units : chr [1:38] NA "mm" "g" ...
- attr(*, "type")= chr "master"
- attr(*, "file.names")= chr "empty"
The fuction loads the file specified by filename in input.folder/pre/csv/. The argument sub specifies which of the two sets of species to be matched (spp.list or data) is a subset (ie smaller) than the other. spp.list is the spp.list attached to the master.
filename <- file.names[file.names == "Amniote_Database_Aug_2015.csv"]
m <- matchObj(file.name = filename,
spp.list = master$spp.list,
sub = "spp.list") # use addMeta function to manually add metadata.
List of 7
$ data.ID : chr "D1"
$ data :'data.frame': 9802 obs. of 20 variables:
..- attr(*, "format")= chr "data:wide"
$ spp.list :'data.frame': 1906 obs. of 4 variables:
..- attr(*, "type")= chr "spp.list"
$ sub : chr "spp.list"
$ set : chr "data"
$ meta :List of 5
..- attr(*, "type")= chr "meta"
$ file.name: Named chr "Amniote_Database_Aug_2015.csv"
..- attr(*, "names")= chr "x1"
- attr(*, "type")= chr "match.object"
- attr(*, "status")= chr "unmatched"
m <- m %>%
separateDatMeta() %>%
compileMeta(input.folder = input.folder) %>%
checkVarMeta(master$metadata) %>%
dataMatchPrep()
[1] "============================================================================"
[1] "processing meta.var: qc"
[1] "Warning: NULL data for meta.var: qc"
[1] "============================================================================"
[1] "processing meta.var: observer"
[1] "Warning: NULL data for meta.var: observer"
[1] "============================================================================"
[1] "processing meta.var: ref"
[1] "loading 'ref/Amniote_Database_Aug_2015.csv'"
[1] "============================================================================"
[1] "processing meta.var: n"
[1] "loading 'n/Amniote_Database_Aug_2015.csv'"
[1] "n vars matched successfully to post/n/Amniote_Database_Aug_2015_n_group.csv"
[1] "============================================================================"
[1] "processing meta.var: notes"
[1] "Warning: NULL data for meta.var: notes"
[1] "D1 metadata complete"
List of 7
$ data.ID : chr "D1"
$ data :'data.frame': 9802 obs. of 22 variables:
$ spp.list :'data.frame': 1906 obs. of 4 variables:
..- attr(*, "type")= chr "spp.list"
$ sub : chr "spp.list"
$ set : chr "data"
$ meta :List of 5
..- attr(*, "type")= chr "meta"
$ file.name: Named chr "Amniote_Database_Aug_2015.csv"
..- attr(*, "names")= chr "x1"
- attr(*, "type")= chr "match.object"
- attr(*, "status")= chr "unmatched"
m <- dataSppMatch(m, syn.links = syn.links, addSpp = T)
Warning in dataSppMatch(m, syn.links = syn.links, addSpp = T): match
incomplete, 8 spp.list datapoints unmatched
List of 8
$ data.ID : chr "D1"
$ data :'data.frame': 1898 obs. of 22 variables:
..- attr(*, "format")= chr "data:wide"
$ spp.list :'data.frame': 1906 obs. of 4 variables:
..- attr(*, "type")= chr "spp.list"
$ sub : chr "spp.list"
$ set : chr "data"
$ meta :List of 5
..- attr(*, "type")= chr "meta"
$ file.name: Named chr "Amniote_Database_Aug_2015.csv"
..- attr(*, "names")= chr "x1"
$ unmatched:'data.frame': 8 obs. of 2 variables:
- attr(*, "type")= chr "match.object"
- attr(*, "status")= chr "matched:incomplete - 8 spp.list unmatched"
output <- masterDataFormat(m, meta.vars, match.vars, var.vars)
Warning in masterDataFormat(m, meta.vars, match.vars, var.vars): 1575 data points missing reference information!:
(9.6% of 16469)
str(output, max.level = 1, vec.len = 3)
List of 3
$ data :'data.frame': 16469 obs. of 11 variables:
..- attr(*, "format")= chr "master"
$ spp.list :'data.frame': 1906 obs. of 4 variables:
..- attr(*, "type")= chr "spp.list"
$ file.name: Named chr "Amniote_Database_Aug_2015.csv"
..- attr(*, "names")= chr "D1"
- attr(*, "type")= chr "data:master"
- attr(*, "status")= chr "matched"
master <- updateMaster(master, output = output)
str(master, max.level = 1, vec.len = 3)
List of 3
$ data :'data.frame': 16469 obs. of 11 variables:
..- attr(*, "format")= chr "master"
$ spp.list:'data.frame': 1906 obs. of 4 variables:
..- attr(*, "type")= chr "spp.list"
$ metadata:'data.frame': 38 obs. of 7 variables:
- attr(*, "type")= chr "master"
- attr(*, "file.names")= Named chr "Amniote_Database_Aug_2015.csv"
..- attr(*, "names")= chr "D1"
jsonedit(master)
jsonedit(m)